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Abstract 

Quadtree  representation  of  matrices  offers  a  homogeneous  representation  for  both 
sparse  and  dense  matrices,  with  advantages  for  processing  on  multiprocessors.  This  paper 
offers  exact  values  for  the  average  depth  and  on  the  number  of  nodes  in  this  representation 
of  some  familiar  patterned  matrices:  symmetric,  triangular,  and  banded.  It  similarly 
measures  three  permutation  matrices  as  comparative  examples  of  non-dense,  unpatterned 
matrices.  Those  results  are  exact  values  for  the  shuffle  and  bit-reversal  permutations 
raised  by  the  fast  Fourier  transform,  as  well  as  tight  bounds  on  the  expected  values  from 
purely  random  permutations.  Two  different  measures  for  density  and  for  sparsity  are 
proposed  from  these  values,  and  a  simple  analysis  of  quadtree  matrix  addition  is  given  as 
an  illustration  of  these  measures. 
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Recent  papers  [11,  12]  have  proposed  a  homogeneous  quadtree  representation  for 
both  dense  and  sparse  matrices,  which  provides  a  graceful  decomposition  of  algorithms 
suitable  for  scheduling  on  multiprocessors.  Most  of  the  discussion  has  been  directed  at 
the  algorithms,  themselves,  focusing  on  the  importance  for  decomposing  n  x  n  matrices 
gracefully  onto  p  processors  for  p«n.  Some  of  these  algorithms  are  outlined  at  the  end 
of  the  next  section. 

All  the  results  there  apply,  as  well,  to  sparse  matrix  techniques,  if  we  can  only  show 
that  the  quadtree  representation  is  acceptably  sparse.  That  is  the  goal  of  this  paper. 
Section  2  defines  the  quadtree  representation  of  matrices,  structuring  dense  and  sparse 
matrices  indistinguishably.  This  homogeneous  representation  comes  at  some  price,  because 
above  all  the  scalar  entries  in  a  matrix  is  a  large  tree  of  nonterminal  nodes,  which  is 
supplanted  by  addressing  hardware  in  the  usual,  sequential  Von  Neumann  memory. 

Although  nontrivial  in  comparison  with  constant-time  access  into  sequentially  stored 
matrices,  the  additional  overhead  from  these  nonterminal  nodes  is  an  artificial  concern  in  a 
couple  of  ways.  First  of  all,  it  may  be  irrelevant  in  the  algorithms  described  below,  which 
typically  involve  recursive  descent.  That  is,  rather  than  accessing  elements  of  a  matrix 
from  the  root  of  the  tree  (analogously  to  indexing  through  a  conventional  array  based  from 
a  single  memory  address),  these  algorithms  recurse  to  nested  and  successively  shallower 
subtrees,  so  that  an  entire  path  from  the  root  is  rarely  traversed  just  to  manipulate  a  single 
element. 

Secondly,  even  if  the  complete  path  were  traversed  upon  every  probe  of  an  array,  the 
time  spent  to  traverse  that  path  might  be  recovered  in  other  ways.  More  specifically,  if  a 
heap  were  spread  across  very  many  memory  banks  and  several  processors  were  performing 
similar  algorithms  on  one  globally  accessible  quadtree-matrix  (whose  nodes  mapped  across 
those  banks  in  a  random  order),  then  the  aggregate  performa  ice  of  those  processors  might 
actually  improve  over  correspondingly  similar  algorithms  on  matrices  stored  sequentially. 
The  improvement  arises  from  the  random  pattern  mitigating  the  problem  of  two,  or  more, 
processors  falling  into  the  same  access  pattern  and  addressing  the  same  banks  simulta¬ 
neously  and  repeatedly.  The  coincidence  of  regular  access  patterns  to  regularly  allocated 
arrays,  even  from  regular  offsets  within  different  matrices,  is  likely  to  become  an  ever  in¬ 
creasing  problem  with  more  processors.  Randomization  available  from  this  kind  of  heap 
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[4]  would  not  prevent  the  first  contention  between  two  algorithms,  but  it  would  certainly 
help  prevent  a  first  from  being  immediately  followed  by  more.  Thus,  the  tree  structure 
would  allow  many  coprocessors  to  run  with  less  memory  contention,  and  to  absorb  the 
cost  of  repeated  path  traversals. 

Both  efficiency  of  access  and  sparse  matrices  are  of  high  interest  in  parallel  processing. 
Many  caching  strategies  fail  under  parallelism,  and  bus  or  switch  contention  compounds 
the  problem  of  wait  states.  Many  problems  sufficiently  large  and  important  enough  to 
justify  parallel  computation  also  exhibit  sparseness.  Therefore,  these  results  are  of  great 
interest  in  designing  parallel  algorithms  and  parallel  computers  [2,  8], 

It  is,  therefore,  important  to  measure  the  costs  of  large  matrices  in  the  quadtree  for¬ 
mat.  The  next  eight  sections  give  some  answers  for  familiar  matrices  [9];  many  of  them  are 
characterized  by  strong  patterning.  Section  2  offers  the  definitions  and  normal  forms  for 
quadtree  representation;  it  also  indicates  how  some  parallel  algorithms  are  fitted  to  this 
representation.  The  next  section  presents  the  expected  path  and  exact  space  measures  for 
dense  and  symmetric  matrices.  Section  4  treats  triangular,  “clip,”  and  finally  banded  ma¬ 
trices,  the  most  familiar  of  sparse  matrices.  The  “shuffle*  and  “bit-reversal*  permutations, 
are  encountered  when  studying  the  fast  Fourier  transform,  are  analyzed  in  Section  5;  the 
latter,  here  called  the  FFT  permutation ,  is  the  worst  case  for  these  measures.  Section  6 
gives  tight  bounds  on  random  permutation  matrices,  and  shows  them  to  behave  nearly  as 
badly  as  the  worst  case.  Based  on  the  foregoing  analyses,  we  propose  measures  for  sparsity 
and  density  in  Section  7.  As  an  example  of  their  use,  Section  8  offers  a  worst-case  analysis 
of  the  performance  of  matrix  addition.  The  last  section  offers  conclusions. 

Section  2.  Quadtree  Representation  and  Algorithms 

Dimension  refers  to  the  number  of  subscripts  on  an  array.  Order  of  a  square  matrix 
means  the  number  of  its  rows  or  columns  when  written  as  the  conventional  tableau.  Simi¬ 
larly,  the  size  of  a  vector  is  the  number  of  elements  when  it  is  expressed  in  the  conventional 
tuple  formulation. 

Let  any  d-dimensional  array  be  represented  as  a  2d-ary  tree.  Here  only  matrices  and 
vectors  are  considered,  where  d  =  2  suggests  quadtrees,  and  d  =  1  suggests  binary  trees. 
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Matrix  algorithms  will  be  arranged  so  that  we  may  (without  loss)  perceive  any  nonzero 
scalar,  x,  as  a  diagonal  matrix  of  arbitrary  order,  entirely  of  zeroes  except  for  x’s  on  the 
main  diagonal;  that  is,  x  =  [i(5t>y].  (This  particular  compression,  however,  is  ignored  in  the 
analyses  of  this  paper.)  Thus,  a  domain  is  postulated  that  coalesces  scalars  and  matrices, 
with  every  scalar-like  object  conforming  also  as  a  matrix  of  any  order.  Of  particular  interest 
is  the  scalar  1,  which  is  at  once  the  unique  multiplicative  identity  for  scalar/ matrix  arith¬ 
metic.  The  additive  identity,  0,  is  represented  by  the  null  pointer,  NIL  (using  PASCAL 
notation),  which  is  particularly  helpful  in  reducing  dereferencing  necessary  in  manipulating 
sparse  matrices. 

A  matrix  (of  otherwise-known  order)  is  either  a  ‘scalar’  or  it  is  a  quadruple  of  four 
equally-ordered  submatrices.  So  that  this  recursive  cleaving  works  smoothly,  we  embed  a 
matrix  of  size  rtxnina  2f,gnl  x  2^,gnl  matrix,  justified  at  the  lower,  right  (southeast) 
corner  with  zero  padding  to  the  north  and  west.  Padding  with  NIL  minimizes  the  space 
consumed  in  padding.  The  matrix  is  justified  to  the  southeast,  rather  than  the  northwest, 
so  to  help  with  computation  of  eliminants  [l]. 

This  prescribes  a  normal  form  for  quadtrees:  no  scalar  entry  is  ever  0,  four  quadrants 
cannot  all  be  NIL,  and  if  the  southwest  and  northeast  are  NIL  then  the  northwest  and 
southeast  cannot  be  the  same  scalar.  Similarly,  NIL  as  a  vector  refers  to  the  zero  vector, 
and  any  non-zero  scalar  x  is  interpreted  as  a  vector  of  arbitrary  size,  each  of  whose  elements 
is  x;  this  normal  form  for  vectors  precludes  any  entry  from  being  0  and  any  brothers  from 
both  being  NIL  or  the  same  scalar. 

Inferring  the  conventional  meaning  from  such  a  matrix  now  requires  additional  in¬ 
formation  (viz.  its  order),  but  we  can  proceed  quite  far  without  size  information;  it  only 
becomes  critical  upon  Input  or  Output.  One  must  acknowledge  that  the  I/O  conversions 
are  non-trivial  algorithms  [12],  but  because  they  consume  little  processor  resource — and 
are  restrained,  also,  by  communication  bandwidth — we  eschew  them  here.  Like  floating¬ 
point  number  conversions,  they  are  an  irritating  impediment  to  one  who  would  experiment 
with  the  algorithms  discussed  below. 

A  “header”  above  each  matrix  quadtree  should  contain  its  size,  necessary  for  output 
translation  and  needed  for  better  control  of  certain  algorithms,  like  pivoting  on  singular 
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matrices.  Often,  however,  its  value  is  not  essential  to  binary  operators,  especially  if  con- 
formability  checks  are  unnecessary  at  run  time.  Here,  size  is  the  proper  length  of  the  main 
diagonal — exclusive  of  any  padding  and  normalization. 

An  optional  annotation  is  to  include  a  bit  within  each  pointer,  indicating  that  the 
referenced  tree  structure  is  to  be  interpreted  as  transposed,  recursively  interchanging 
southwest\northeast  quadrants  upon  any  access.  Call  this  the  transposed  flag.  With  it, 
quadtree  representation  allow  transposition  of  an  entire  matrix  in  constant  time — at  the 
cost  of  building  a  new  reference  with  that  flag  inverted.  It  also  indicates  how  row  and 
column  traversal  use  the  same  algorithm:  a  symmetric-order  traversal  of  the  appropriately 
projected  binary  tree.  However,  unless  hardware  support  is  available,  testing  of  this  flag 
slows  dereferencing  of  every  quadrant.  Because  of  this  cost,  is  used  in  this  paper  only  in 
comparing  measures  for  the  explicit  class  of  symmetric  matrices. 

Tree-structured  memory  is  well  suited  to  functional  or  applicative  programming  style, 
whose  only  control  structure  is  function  invocation  and  whose  only  iteration  is  expressed 
through  recursion.  The  recursive  definition  of  a  quaternary  trees  mimics  the  recursive 
structure  of  programs  that  manipulate  them. 

Thus,  the  algorithm  for  matrix  addition  jll]  decomposes  naturally  into  four  quad¬ 
rant  additions  which  are  separate  and  independent  processes.  Because  of  their  mutual 
independence,  these  four  are  naturally  computed  in  parallel  within  a  shared  memory,  or 
distributed  to  independent  processors  with  private  memory.  In  the  latter  case,  the  tree 
structure  of  the  matrix  may  be  better  mapped  onto  a  tree  of  private  memories.  And  the 
division  extends  naturally  to  16,  64,  256,  etc.  processors,  or — by  splitting  the  sums  in  half, 
rather  than  in  quarters — to  2,  8,  128  etc.  as  well.  When  either  addend  is  NIL,  addition 
returns  a  shared  reference  to  the  other  addend. 

The  problem  of  matrix  multiplication  may  be  decomposed  two  ways  (again  treating 
the  product  as  two  halves),  four  ways  (the  four  quadrants  of  the  answer),  and  eight  ways 
(the  eight  quadrant  products  in  Strassen’s  decomposition  [10]  of  Gaussian  matrix  multi¬ 
plication.)  Whenever  a  multiplier  or  multiplicand  is  NIL,  the  product  is  annihilated  to 
NIL;  if  either  is  1  the  product  becomes  a  reference  to  the  other. 

Solutions  to  linear  systems  and  matrix  inversion  have  been  reduced  to  the  Pivot  Step 
algorithm  [5],  where  the  “independent”  problem  of  a  stable  choice  for  the  pivot  element 
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folds  naturally  onto  the  tree.  The  quadtree  not  only  suggests  efficient  tree-search,  but 
also  it  provides  the  structure  upon  which  to  bind  decorations  (local  maxima  in  a  sparse 
matrix)  that  do  not  change  across  many  pivot  steps,  thereby  truncating  the  full  search  for 
successive  pivots. 

The  remainder  of  this  paper  addresses  static  measures  of  quadtree  representation,  only 
briefly  considering  any  of  the  algorithms  mentioned  above.  Some  of  the  efficiencies  provided 
by  normal  form,  moreover,  are  ignored,  because  this  paper  only  considers  efficiencies  due 
to  the  occurrences  of  NIL  in  sparse  matrices.  It  does  not  analyze,  for  instance,  space 
efficiencies  due  to  the  representation  of  a  16  x  16  identity  submatrix  simply  as  the  scalar  1. 
Moreover,  even  though  implicitly  shared  references  are  likely  to  result  from  many  programs, 
none  of  the  analyses  that  follow  consider  any  sharing  beyond  that  explicitly  stated  and 
indicated  in  the  figures.  All  these  measures  are,  therefore,  slightly  conservative. 

Section  3.  Dense  and  Symmetric  Matrices 

A  matrix  is  dense  if  it  contains  no  zero  entries.  It  is  symmetric  either  if  it  is  a  non¬ 
zero  scalar,  or  if  its  northwest  and  southeast  quadrants  are  symmetric  and  its  northeast 
and  southwest  pointers  are  transposes  of  one-another;  they  share  a  reference  in  the  tree 
representation,  except  for  inverting  the  “transposed”  flag  on  one.  These  shared  quadrants 
are  presumed  to  be  dense  in  the  analysis  that  follows.  Both  these  decompositions  are 
illustrated  in  Figure  1. 

Let  n  =  2P,  for  p  an  integer,  and  define  the  functions  Sd,  mapping  p  to  the  number 
of  nodes  {space)  necessary  to  represent  a  dense  n  x  n  matrix,  and  Pd,  mapping  p  to  the 
average  path  length  in  a  dense  n  x  n  matrix.  Similarly,  maps  p  to  the  space  necessary 
to  represent  a  symmetric  n  x  n  matrix,  and  Ps  maps  p  to  the  average  path  length  in  a 
symmetric  n  x  n  matrix. 

Then 


Sc(0)  =  1; 

s5(0)  =  1; 

Sd{p  +  1)  =  1  +  4  Sx>(p); 

S's(p+  1)  =  1  +  2  Ss[p)  +  SD{p). 
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In  the  last  case,  the  southwest  and  northeast  quadrants  are  dense  and  coincident,  and, 
together,  they  only  contribute  to  the  space  cost  but  once.  Figure  1  shows  this  sharing, 
which  requires  the  use  of  the  optional  /it  transposed  flag.  Thence, 


sd(p) = y*]  4*  = 
*=o 


4P+1  _  i 
3 


=  =  £(»+!>(—  *>• 

SsM  =  -  jE2'  +  T  E I  =  §(<”  - 1)  +  2p 

i"=0  »=0  *=0 

=  K»2  -  1)  +  «  =  I(n  +  2)(n  -  i). 


In  traditional  representations  using  sequentiality  of  memory  addresses,  we  expect  these 
space  requirements  to  be  n2  and  (n2  +  n)/2,  respectively,  although  additional  information 
is  needed  for  a  program  to  attain  the  second. 

Assuming  (simplistically)  that  scalars  and  nonterminal  nodes  all  take  the  same  space, 
we  immediately  observe  a  33%  space  increase  over  sequential  storage  of  matrices.  In  fact, 
it  is  likely  that  the  nonterminal  nodes  (of  four  pointers)  are  larger  than  terminal  nodes  (a 
single  scalar) ,  indicating  an  additional  consideration — the  constant  of  proportionality — to 
be  considered  when  comparing  different  representations.  This  space  difference  is  probably 
closer  to  66%  for  that  reason. 

The  expense  to  access  an  individual  element  is  a  less  critical  measure,  because  most 
algorithms  do  not  perform  that  operation  as  a  primitive  fetch  from  the  root  of  the  tree. 
Rather,  the  problem  decomposes,  so  that  the  fetch  is  issued  only  locally,  from  control 
points  immediately  above  the  terminal  (scalar)  nodes.  However,  it  is  a  cost  of  interest  for 
conventional  algorithms. 


Pd{  0)  =  1; 

Pd{p  +  l)  =  1  +  Pd{p)  —  P  +  2. 

Ps{p)  =  Pd{p)  =  p  +  1  =  lgn  +  1. 

Section  4.  Triangular,  Clip,  and  Banded  Matrices 

A  triangular  matrix  either  is  a  non-zero  scalar  or  it  decomposes  into  quadrants,  of 
which  the  northwest  and  southeast  are  triangular,  the  northeast  is  zero,  and  the  southwest 
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is  dense  (or  vice  versa,  but  consistently).  The  parenthetical  postscript  here  is  intended 
to  provide  for  both  “upper”  and  “lower”  triangular,  and  this  recursive  definition  is  to  be 
unfolded  consistently  in  order  to  preserve  one  of  those  ‘shapes.’  If  interpreted  freely,  it  also 
allows  other  ‘folded’  shapes,  and  the  following  analysis  would  still  apply  to  them.  Again, 
Figure  1  illustrates  this  definition. 

A  “clip”  matrix  (Figure  2)  looks  similar  to  a  triangular  matrix,  with  the  role  of  the 
main  diagonal  replaced  by  one  closer  to  a  corner.  A  measure  of  this  distance  is  needed, 
characterized  by  bandwidth,  b,  where  (b  <  n),  a  measure  of  the  width  of  the  non- zero 
portion  of  the  matrix.  In  this  paper,  b  will  always  be  2W  for  some  integer  tv  <  p  =  lg  n. 
When  b  =  1,  a  clip  matrix  has  a  non-zero  entry  only  in  the  extreme  southwest  or  extreme 
northeast  entry  (exclusively).  When  b  =  n  it  is  a  triangular  matrix. 

An  n  x  n  clip  matrix  of  bandwidth  b  is  either  a  triangular  matrix  with  n  =  6,  or  b  <  n 
and  it  has  four  quadrants,  of  which  three,  including  the  northwest  and  southeast,  are  zero 
and  the  remaining  quadrant  is  a  clip  matrix  of  bandwidth  b.  Again,  the  choice  of  northeast 
or  southwest  for  the  clip  is  intended  to  be  made  consistently  over  any  unfolding  of  this 
recursive  definition;  otherwise,  strange  foldings  arise,  but  their  analyses  remain  correct. 

Intuitively,  a  banded  matrix  is  zero  far  away  from  the  main  diagonal,  but  relatively 
dense  in  a  band  of  width  b  from  the  diagonal.  The  bandwidth  of  a  matrix,  [a,,/],  is  defined 
to  be 

max  {|t  —  j|  |  aitj  ±  O}. 

This  measure  is  just  under  half  that  of  a  common  alternative  that  is  used  to  give  these 
forms  their  Greek  names,  the  full  width  of  the  nonzero  “band”  on  either  side  of — and 
including — the  main  diagonal. 

An  n  x  n  banded  matrix  of  bandwidth  6,  for  b  <  n,  is  either  a  dense  matrix  with 
b  =  n,  or  has  northwest  and  southeast  quadrants  being  banded  matrices,  and  northeast 
and  southwest  quadrants  being  clip  matrices,  all  of  bandwidth  b  when  b  <  n.  Once  again,  it 
is  intended  that  the  clipping  be  in  a  consistent  pattern  similar  to  Figure  2,  but  even  if  this 
shape  is  violated,  the  following  analyses  still  hold.  (Dense  matrices  may  be  characterized 
to  be  banded  in  two  ways:  either  with  6  =  n  or  with  6  =  n  —  1.  Since  we  here  restrict  n 
and  6  to  powers  of  two,  however,  this  confusing  redundancy  only  arises  with  n  =  2  and 
either  b  =  2  or  b  =  1.) 
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As  before,  define  St  mapping  p  to  the  number  of  nodes  (space)  necessary  to  represent 
a  triangular  n  x  n  matrix,  and  Pt  that  maps  p  to  the  average  path  length  in  a  triangular 
n  x  n  matrix.  (By  convention  p  =  lgn  and  w  =  lgi.)  Similarly,  Sc  maps  the  pair,  (p,  w) , 
to  the  space  necessary  to  represent  an  n  x  n  clip  matrix  of  bandwidth  6,  and  Pc  maps 
( p,w )  to  the  average  path  length  in  an  n  x  n  clip  matrix  of  bandwidth  b.  Sb  maps  the 
pair,  (p,  w) ,  to  the  space  necessary  to  represent  aiuxn  banded  matrix  of  bandwidth  6, 
and  Pb  maps  (p,  w)  to  the  average  path  length  in  an  n  x  n  banded  matrix  of  bandwidth  b. 

For  triangular  matrices, 


St(p)  =  Ss(p ) 

Pt(0)  =  1; 

Pt(p+  1)  =  1  +  \Pd(p)  +  \Pt(p )  +  jO. 


At  last,  we  see  the  effects  of  avseraging  paths  across  sparsely  represented  entries!  The 
path  through  a  non-trivial  triangular  matrix  certainly  passes  through  a  root  node,  and 
then  into  one  of  the  four  quadrants  with  equal  probability  of  25%  .  Solving  [7,  Eqn.  2.75], 


i’r(p)  =  E2-i  +  ?'f  2-  +  lg*'2i 

t=0  0  0 


=  !(p  +  3-2-P)=!(lgn  +  3-i). 

As  expected,  the  path  length  averages  to  be  half  the  depth  of  the  quadtree,  because  an 
isolated  half  of  the  tree  is  NIL. 


For  clip  matrices, 

Pc(p,P )  =  Pt[p)’, 

Pc(p+  l,w)  =  1  +  \Pc(p,w)- 
Sc(p,p)  =  ST(p ); 

Sc(p+  l,w)  =  1  +  Sc(p,rv). 


Pc(p)  -  |  +  (I  +  w  ~  2  w] 

=  t  +  i(*)*li +  *»-«■ 

Sc(p)  =  p  —  w  +  1(4*"  —  1)  +  2W 

=  |(62-l)+&  +  lg(£). 


Solving  these,  we  have 


Banded  matrices  are  composed  of  clip  quadrants  and  dense  quadrants  as  shown  in 
Figure  2. 

Pb{p,p)  =  Pd{p); 

Pb{p+  1,u>)  =  1  +  \Pb{p,w)  +  |  Pc{p,w). 

Sb{p,p)  =  SD(p); 

Sb{p+  l,tw)  =  1  +  2  Sb(p,u>)  +  2Sc{p,w). 

The  general  solutions  are  more  complicated: 

PB(p,w)  =  f  +  £[2(w  -  1)  +  (2-p  -  2-)  -  £(tv  +  I)] 

=  T  +i[2(lg6-l)  +  (i-i)-i(lg6  +  J)]; 

SB  (p,  w)  =  |(2P+U,+1  +  2P"W  -  22w)  +  2[(2P  -  2W)  -  (p  -  u>)]  -  | 

=  |(2n6  -  b2  +  £  -  1)  +  2[»  -  b  -  lg(f )]  - 

These  are  most  interesting  only  when  we  consider  particular  values  of  w,  to  reveal  space 
and  average  path  for  specific  bandings,  e.g.  tridiagonal  matrices: 

‘S’b(PiO)  =  6n  -  21gn  -  5, 


pentadiagonal  matrices: 


_  ,  10  3  2 

Pb  (Pi  0)  —  _  d"  -  2  > 
3  n  3  n 


Sb{p,  1)  =  8n  -  2lgn  -  9, 

P  f  -x  _  10  1  10 

Pb{Pi  1)  —  3  n  3n2, 


and  enneadiagonal  matrices: 


<Sb(p,  2)  =  13n  —  2lgn  —  26, 

„  ,  10  7  100 

Pb(p'  2)  =  T  +  n“5^' 

Alternatively,  one  might  define  bandwidth,  b,  of  the  form  b  =  2V  -  1  for  integers  v, 
rather  than  b  =  2W.  This  allows  verification  of  the  above  results  on  tridiagonal  matrices 
(w  =  0;b  =  1),  and  fills  in  the  following  values  for  heptadiagonal  matrices  ( b  =  3): 

space  =  13n  -  2lgn  —  26, 

10  7  100 

avgpath  =  T  +  -  - 


Thus,  a  tridiagonal  matrix  that  requires  space  of  at  least  3n  —  2  cells  in  sequential 
storage  requires  just  under  twice  that  as  a  quadtree,  although  that  proportion  falls  as  the 
bandwidth  increases.  However,  expected  depth,  a  reflection  of  access  cost,  stays  remarkably 
constant  over  various  bandwidths. 

Section  5.  Shuffle  and  FFT  Permutation  Matrices 

Because  any  permutation  matrix  can  be  represented  as  a  vector  of  integers  (regardless 
of  whether  a  vector  is  represented  using  sequential  memory  or  as  a  binary  tree) ,  it  seems 
wasteful  even  to  consider  them  expanded  in  a  normal  matrix  representation.  They  are 
studied  here  because  they  initially  seem  to  be  really  sparse,  but  turn  out  to  be  surpris¬ 
ingly  expensive.  Permutation  matrices,  particularly  those  associated  with  the  fast  Fourier 
transform  (FFT)  algorithm,  have  little  patterning  of  zeroes  that  would  allow  a  collapse 
of  the  quadtree  representation.  Since,  as  argued  elsewhere  [3],  patterning  is  necessary  to 
sparseness,  it  will  be  interesting  to  see  how  these  measures  of  space  and  path  length  will 
compare  with  those  of  the  highly  patterned  matrices,  already  presented. 

Because  we  are  more  interested  in  the  patterning  of  zeroes  in  the  permutation  matrices 
than  in  any  space  compression  possible  from  sharing  within  them,  all  space  analyses  will 
ignore  the  possibility  of  sharing  submatrices.  Let  a  permutation-patterned  matrix  be  any 
n  x  n  real  matrix  with  n  non-zero  entries,  each  row  or  column  of  which  contains  exactly 
n  —  1  zero  entries.  It  is  then  reasonable  to  presume  that  each  non-zero  entry  is  unique  in 
such  a  matrix,  so  that  no  space  savings  from  sharing  of  submatrices  is  possible. 

Two  permutations,  Dp  and  Sp  occur  naturally  in  developing  the  FFT,  which  are  here 
called  deal  and  shuffle ,  respectively  [13].  Multiplying  a  vector  of  size  n  =  2P  by  Dp  on  the 
left  will  have  the  effect  of  reordering  its  elements  as  if  the  vector  were  a  deck  of  cards,  which 
was  dealt  into  two  full  hands  of  n/2  cards  which  were  then  stacked.  A  typical  layout  of  Dp 
appears  below.  It  is  characterized  by  ones  appearing  in  “knight’s  moves,”  first  descending 
from  the  far  northwest  entry  to  central-east,  and  also  ascending  from  the  far  southeast 
entry  to  central-west. 
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Sp  is  the  inverse  of  Dp\  Pease  [6]  names  Sp  as  P,  without  specifying  its  order.  Multiplying  a 
vector  on  the  left  by  Sp  has  the  effect  of  performing  a  perfect  riffle-shuffle  on  the  elements  of 
the  vector.  Multiplying  on  the  right  by  these  permutations  reorders  the  columns  similarly. 

Figure  2  illustrates  how  Dp  can  be  restructured  into  Dp+i  for  p  >  0.  D0  =  1  trivially, 
by  duplicating  and  restructuring  its  quadrants.  This  observation  leads  to  the  following 
two  recurrences,  expressed  using  the  notation,  SD ,  standing  for  both  “shuffle”  and  “deal,” 
distinguished  from  “symmetric”  or  “dense.”  For  the  space  of  a  shuffle/deal  permutation- 
patterned  matrix: 

Ssd(0)  =  1; 

Ssd(1)  =  3; 

Ssd  (p  +  1)  =  5  +  2(Ssd  (p)  -  1)  • 

The  recurrence  for  path  length  is  similar.  Again,  Figure  2  shows  how 

Psd(0)  =  l; 

fto(i)  =  }; 

Psd{p  +  1)  =  2  +  l(PsD(p,w)  -  1). 

Solving  these  recurrences,  we  find  that  for  p  >  0 

p-3 

Ssd (p)  =  3(  Y,  2‘)  +  2p-2S'sd(2)  =  3(2P  -  1)  =  3 (n  -  1); 

i=0 

•Psd(p)  =  3(1  —  2-p)  =  3(1  —  — ). 

n 

If  sharing  were  allowed,  then  Ssd(p)  would  collapse  to  4p  —  2  and  2*  x  2*  shuffle/deal 
matrices  for  for  all  integers  i  <  p  could  be  represented  in  shared  space  5p  —  3. 


The  FFT  permutation  (p-bit-reversal) ,  arises  as  a  consequence  of  Pease’s  recurrence 
[6],  given  explicitly  by  Wise  [13]  who  calls  it  Fp.  A  key  step  is  to  factor  out  Sp  (or  Dp)  at  the 
left  (respectively,  right)  end  at  every  step  in  the  recurrence;  by  associating  all  these  factors 
to  the  left  (right)  exclusively  of  other  arithmetic,  one  of  the  following  two  factorizations 
arises. 


Fp  is  called  “bit-reversal”  permutation  because  it  exchanges  xt-  and  Xb(i)  in  permut¬ 
ing  x,  where  b  is  a  function  on  natural  numbers  less  than  2P  that  reverses  the  p-bit 
strings  that  represent  them.  That  is,  the  element  of  x  indexed  i  =  £2y=o  by  •  2J  is  in¬ 
dexed  o  bn_y  •  2J  after  that  permutation.  (This  fact  can  be  established  by  a  simple 
induction  on  p  from  either  factorization  above.)  Since  b  is  its  own  inverse,  Fp  is  a  symmet¬ 
ric  permutation  matrix,  and  so  the  space  measure  for  Fp  that  follows  might  be  considerably 
reduced  by  allowing  for  sharing. 

If  one  considers  an  n  x  n  FFT  permutation-patterned  matrix  where  n  =  4k,  then 
the  42k  entries  can  be  perceived  as  a  2fc  x  2k  matrix  of  blocks,  each  of  which  is  a  2k  x  2k 
matrices.  Each  of  those  blocks  has  exactly  one  element  that  is  non-zero.  Thus,  its  quadtree 
is  gridded  as  illustrated  in  Figure  3. 

On  drawing  the  block  decomposition  described  above  as  a  tree,  we  find  that  the 
quaternary  tree  is  complete  for  the  first  k  levels,  down  to  the  4k  intermediate  notes  that 
are  associated  with  those  blocks.  The  tree  rooted  at  each  intermediate  node  is  metalinear; 
that  is,  each  node  has  at  most  one  son,  along  the  path  toward  the  unique  1  entry.  All 
other  pointers  in  those  subtrees  are  NIL.  Figure  3  also  illustrates  this  tree. 

Using  the  notation  from  the  previous  sections,  let  Sp  map  p  onto  the  space  required 
to  represent  an  FFT  permutation-patterned  matrix,  and  Pp  map  p  onto  the  expected  path 


length  in  an  FFT  permutation  matrix.  We  have  already  constrained  n  =  4*,  so  p  =  2k. 
Then 


Sp(0)  =  1; 

Sp{p  +  2)  =  1  +  4<Sjp(p)  +  2P+2. 

The  last  equation  arises  by  adding  a  root  above  four  FFT  permutation-patterned  matrices, 
and  extending  each  metalinear  chain  by  one  level  at  the  bottom.  Similarly, 


Pf{  0)  =  l; 

Pf{p  +  2)  =  1  +  Pf(p)  + 


2P+2 

4P+2' 


Solving  these  two  recurrences, 


Sf{p)  = 
Pf[p  +  2)  = 


nlgn  An  1 

2  ~3  3  ’ 

lgn  4  _  _1_ 

2  3  3n* 


The  results  show  that  the  space  needed  for  an  n  x  n  FFT  permutation-patterned  matrix 
grows  as  0(nlgn),  significantly  more  than  the  linear  space  we  get  with  the  vector-of- 
indices  representation!  The  average  path  length,  too,  is  surpxisingly  poor,  essentially 
reflecting  the  completeness  of  the  tree  to  that  level,  but  already  more  than  half  that  of  a 
dense  matrix! 


Theorem  1.  The  FFT  permutation  exhibits  the  worst  case  measures  of  any  equivalently- 
sized  permutation-patterned  matrix  with  respect  to  both  space  and  path  length. 

Proof  is  by  an  induction  like  those  above.  The  worst  case  for  space  requires  that  each 
terminal  scalar  be  isolated  in  its  own  subtree,  with  minimum  sharing  of  its  path  with  any 
of  its  cousins;  more  sharing  would  reduce  total  space.  Inspection  of  the  tree  in  Figure  3 
shows  that  the  most  space  is  used  by  completing  the  shared  tree  as  high  in  the  tree  as 
possible,  and  hanging  n  independent  paths,  each  of  length,  (.gn)/2,  beneath  .  The  worst 
case  for  average  path  length  occurs  when  all  terminal  scalars  occur  at  depth  lg  n,  as  they 
do  in  the  FFT  permutation-pattern,  even  when  the  tree  for  [x£t)y]  can  be  represented  as 


Section  6.  Random  Permutations 

In  this  section  we  find  tight  bounds  for  the  average  time  and  space  required  by  quadtree 
representation  of  random  permutation-patterned  matrices.  Define  P(p)  to  be  a  n  x  n 
permutation  matrix  where  n  =  2P  and  p  is  an  integer.  As  before,  we  define  Sp  and 
Pp  that  map  p  to  the  average  number  of  nodes  (space)  necessary  to  represent  an  n  x  n 
permutation-patterned  matrix  and  the  average  path  length  in  an  n  x  n  permutation  matrix, 
respectively,  where  n  =  2P,  p  an  integer.  We  also  define  Tp>{p)  to  be  a  complete  quadtree 
obtained  for  a  2P  x  2P  matrix  and  Tp(p)  to  be  the  quadtree  that  results  from  representing 
a  given  2P  x  2P  permutation  matrix  P(p). 

Theorem  2.  If  p  is  even 

-  2P  (p,  r,/*  I  tn  )+2pft  rife)  4 

-I  +  2-PV2  1  +  2-"  s-(l +  *-')*/  \3  /  3 

and 

SP(P)>2”(  f  + 

If  p  is  odd 

<  2,1  /P  .  1  .  3-P/2  +  l/I  ,  V  .p  1  4 

-l  +  2-»V2  2  l  +  2-l>  T  3(1  +  2 "»)»;  \3  2  )  3 

and 

Sp(p)  >  2'  (|  +  .768)  -  i. 

Proof:  We  only  analyze  the  case  p  is  even.  The  case  p  is  odd  may  be  established  in 
similar  fashion.  Associate  with  each  node  in  Tp(p)  a  pair  of  integers  (t,  j)  where  t  specifies 
the  level  of  the  node  in  To(p)  and  j  is  the  left  to  right  order  of  the  node  at  that  level. 
Let  Np(i,j)  be  an  indicator  variable  which  has  value  1  if  node  (»',  j)  exists  in  Tp(p)  and 
0  otherwise.  The  expectation  of  Np(i,j)  is  the  probability  that  node  (*,  j)  is  in  Tp{p). 
Then, 

Sp{p )  =  E{NP{iJ))  =  ^  pr(node  («,  j)  is  in  TP(p)).  (1) 

i,:  «'»i 

But  node  (i,j)  is  in  Tp(p)  if  and  only  if  the  submatrix  corresponding  to  that  node  is  not 
zero.  Let  pr(p,  t)  denote  the  probability  that  the  submatrix  corresponding  to  a  node  at 
level  i  is  zero.  Therefore,  the  probability  that  node  (t,  j)  is  in  Tp(p)  is  1  -  pr(p,i). 
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Figure  4  illustrates  the  calculation  of  pr(p,  *').  The  2p-‘  x2p_*  submatrix  corresponding 
to  node  (*',/)  has  been  placed  at  the  lower  right  comer  of  matrix  P  for  simplicity;  this 
matrix  must  be  entirely  zero.  In  order  for  P  to  be  a  permutation  matrix  exactly  2P_*  rows 
of  submatrix  B  must  contain  l’s.  This  requires  2p-t  columns  of  sub  matrix  A  to  be  all 
zero.  There  are  (2P^*  )  ways  to  choose  2p-t  zero  columns  in  A.  For  each  way  to  choose 
2P“*  zero  columns  in  A  there  are  (2P  -  2p— *)!  ways  to  place  l’s  in  submatrices  A  and  C 
such  that  no  two  are  in  the  same  row  or  column.  Finally,  for  each  way  to  place  l’s  in  A 
and  C  there  are  2P-M  ways  to  place  l’s  in  B  (these  must  be  placed  in  the  zero  columns  of 
A).  Thus,  for  a  node  at  level  t, 


pr(p,i)  = 


(2P~Vi  ‘)2p-*'!(2p  -  2P— *)!  (2P  -  2p“‘)!(2p  -  2P"‘)! 

2p!  —  (2p  -  2  -2p_t)!2p! 

(2P  -  2P“*  -  0)  (2p  -  2p-<  -  1)  (2P  -  2P~{  -  2)  (2P  -  2P~<+1  +  1) 

(2p  —  0)  ’  (2p  —  1)  '  (2P  —  2)  (2p-2p~*  +  l) 


2p-*  \ 

2  P-k)' 


(2) 


Using  (2),  the  fact  that  the  number  of  nodes  on  Level  *  is  4’,  and  that  the  node  on  Level 
0  is  always  present,  we  can  rewrite  (l)  as  follows: 


Sp(p)  =  1  +  ^  4* 

»=i 


2p-t 

2P  -  k 


(3) 


We  now  split  the  sum  in  (3)  into  two  subsums  by  cleaving  the  range  of  k  and  derive 
upper  and  lower  bounds  for  each.  These  will  later  be  added  to  establish  the  theorem. 

a.  Consider  the  sum 


i=p/  2+1  \  k= 0  v  7 

<  E  - — — ) 

“  .  1  V  2p  —  2p-‘  +  1/ 

*=p/2+l  V  V  7 


2P— 


From  Lemma  1  of  the  appendix,  2P  *  >  1  implies 


2p 


2  p“*  \aP" 

—  2p~*  -hi/ 


>  1  - 


23(p-0 

2p  -  2p"‘  +  1 ' 
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Therefore,  we  can  bound  U pper  from  above  as  follows: 


Upper<  ±  4‘(  4P"  W  ±  _ I 

2 p  1 

"  l  +  1- 


+  2~p 


»=p/2+l  1+2-p 

Making  use  of  the  fact  that  1/(1  —  x)  <  1  +  x  +  2x2  if  0  <  x  <  1/2  we  can  write 


2P  /  2“*'  2-2"2*  \ 
PP€f  "  1  +  2-p  ,=^+1  V  +  1  +  2-p  +  (1  +  2-p)2  ) 

^  2P  fp  2"P/2  2-2“P  \ 

“  1  +  2-p  \2  +  1  +  2-p  +  3  •  (1  +  2-p)2  )  ‘ 


Upper  is  bounded  from  below  by 


t,pper- ,_fe.  4'f1‘(1"^) 


t=p/2+l 


From  Lemma  2  of  the  appendix,  2P  *  >  1  implies 


(■-?) 


-»\  2*~' 


22(p— 0  24(p-‘) 

—  1  2P  22p+1 


Therefore, 


Upper  >  £  4*  (2p-2i  —  22P--‘~1)  >  ^ 


»=p/2+l 


b.  Now  consider  the  sum 


P/*  I  2p— -1  ,  p_i  v 

£“Mr=£4‘(1'  S  ('-d 

*£<‘(‘-(>-*^£71) 


Lemma  3  of  the  appendix  asserts  that  (1  —  x)v  >  e  xv/U  *)  for  all  positive  x  and  y. 
Therefore, 

p/2  p/2-1  ,  „  x  2»’-<  /  \ 


P£2  P/iT1  /  J  \2p-  / 

LowerK^-  £  Wl--— ) 

t=l  »=1  \  /  \  / 

2p+2  4  /  ap/3  \ 

SV'3-2P(' 
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Lower  is  bounded  from  below  by 


p/  .  /  /  2p_‘\ 
Lower  >  I  1  -  (  1  -  — J 


Lemma  4  of  the  appendix  asserts  that  (1  —  x)v*  <  e  x  v  for  all  positive  z  and  y.  So 

P/2  /  „  -  \  P/2-3 


Lower  >  £*  -  2-  («-  +  i«-  +  -  £  4*. 

2P+2  4 

>  — - -  -  .38  •  2P. 

~  3  3 

Putting  the  bounds  of  Cases  a  and  b  together  proves  the  theorem.  | 
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Corollary  1.  For  n  a  large  power  of  2,  the  average  space  required  fornxn  permutation 
patterned  matrices  is  between  (nlgn)/2  +  .768n  —  4/3  and  (nlgn)/2  +  .983n  —  4/3. 


Proof:  From  Theorem  2  the  average  space  gets  trapped  between  the  lower  bound  of 
2p(p/2+.768)— 4/3  and  2p(p/2+.783)-4/3  and  the  upper  limit  of  2p(p/2+4/3-e~1)-4/3 
and  2p(p/2  +  7/6  —  e~x /2)  —  4/3  .  The  result  follows.  | 


Theorem  3.  If  p  is  even,  then 


Pp(p)  < 


-2^i_  1  (1  2  2"p  \  , 

ap/3-3  -| - -  + - )  +  1 

l  +  2-p  \3  71t2"p/ 


and 


v  2-p 

Pp(p)  >1  +  -862  -  — 


If  p  is  odd,  then 


o  1  __ail±I!ZL_ 
Pp(p)<?+o-«  + 


1  +  2-P 


2  2_Eri  \ 

3  +  7(1  +  2-p)  J 


and 

Pp  (P)  >  |  +  -784  - 
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Proof:  Again  we  extend  only  the  case  where  p  is  even  here;  the  case  for  odd  p  is  similar. 
A  path  contains  one  node  from  each  of  levels  0,1,2...  in  Ir>(p)  until  a  node  at  some  level, 
say  *,  representing  an  all  zero  submatrix  is  reached.  Therefore, 

p 

PP(p)  =  ^pr (random  path  is  at  least  i  in  length). 

*=0 

But,  a  path  is  at  least  *  in  length  if  and  only  if  the  level  i  submatrix  is  not  zero.  The 
probability  that  the  level  *  submatrix  is  zero  was  found  in  the  previous  proof.  Using  that 
probability  we  have 

As  before,  we  break  the  sum  of  (4)  into  two  subsums  so  that  the  upper  limit  of  the  first 
subsum  is  p/2  and  the  lower  limit  of  the  second  subsum  is  p/2  +  1  and  find  bounds  for 
each  subsum.  Proceeding  as  in  the  previous  theorem  we  find  that  the  second  subsum  is 
bounded  from  above  by 


Upper  <  — V''  *— ■  ;■ — 

<_?i-  ± 

—  1  +  2“*’  “  V  1  +  2-rJ 


»=p/2+l 


<  — — —  ( ~2~p  +  z  (~ 

~  1  +  2-P  \3  7  VI 

The  second  subsum  is  bounded  from  below  by 


4-p 


+  2 -p 


))• 


Upper  >  ^2  (2p~2i  -  22p_4‘)  =  2P  4“*  -  4P  ^  16_i 

*=p/2+l  *=p/2  +  l  «=p/2+l 

>- - —  -  — . 

“  3  3  •  2?  15 

The  first  subsum  is  bounded  from  above: 

P/2  /  /  J  \  2* _,\  P/2  /  ,p-,  \  qp/3 

Lower  <  £  I  1  -  ^1  -  ?  j  I  <£^l-«  —  )<§-« 

The  first  subsum  is  bounded  from  below: 

P/2 


Lower 


>£(i >?-<- -2..- 


»=1 


Assembling  all  the  bounds  proves  the  theorem.  | 
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Corollary  2.  For  n  a  large  power  of  2  the  average  path  length  in  an  n  x  n  permutation 
matrices  is  between  |lg(n)  +  .763  and  |lg(n)  +  .966. 

Proof:  From  Theorem  3  the  average  path  length  tends  to  between  a  lower  bound  of 
p/2  +  .862  and  p/2  +  .764  and  an  upper  bound  of  p/2  +  4/3  —  e~1  and  p/2  +  7/6  —  e-1. 
The  result  follows  from  n  =  2P.  1 

The  bounds  in  both  corollaries  are  tight  with  respect  to  their  leading  coefficients  of  5, 
which  coincide  with  those  from  the  FFT  permutation.  Their  second  coefficients  lie  between 
55%  and  75%  of  the  corresponding  coefficients,  both  from  the  FFT  permutation.  While 
looser,  the  two  sets  of  bounds  for  those  second  coefficients  bracket  the  same  range,  which 
is  strikingly  centered  on  the  value  |. 

Section  7.  Measures  of  Sparsity  and  Density 

Duff  states  in  his  authoritative  survey,  “In  quantitative  terms,  the  density  of  a  matrix 
is  defined  as  the  percentage  of  the  number  of  nonzeros  to  the  total  number  of  entries  in  the 
matrix.  The  term  sparsity  for  the  complement  of  this  quantity  is  rarely  used.  [3,  p.  500]” 
Rather,  he  suggests  that  sparsity  of  a  matrix  has  as  much  to  do  with  the  distribution  of 
zero  elements  as  with  their  relative  population. 

We  now  have  seen  closed-form  results  for  total-space  and  for  expected-depth  for  var¬ 
ious  patterned  and  and  permutation-patterned  matrices.  These  results  are  summarized 
below  in  Table  1.  His  perspective  is  reflected  in  these  results  for  the  space  and  access-path 
for  familiar  kinds  of  matrices  represented  as  quadtrees. 

Based  on  Duff’s  caveat  and  these  numbers,  we  present  [12, 13]  measures  of  both  density 
and  sparsity  that  are  motivated  by  results  on  quadtrees,  but  are  defined  independently  of 
any  particular  representation.  Their  values  also  appear  in  Table  1. 

Density  of  a  particular  matrix  is  the  ratio  between  the  space  it  occupies,  and  the  space 
occupied  by  a  dense  matrix  of  the  same  order.  Non-sparsity  of  a  particular  matrix  is  the 
ratio  between  the  expected  time  to  access  a  random  element  (path  length  for  quadtrees), 
and  the  expected  access  time  within  a  dense  matrix  of  the  same  order.  Sparsity  is  the 
difference  between  one  and  this  non-sparsity  measure. 

Both  density  and  sparsity  are  measured  on  a  scale  from  zero  to  one.  For  the  conven¬ 
tional  row-major,  sequential  representation  of  matrices,  the  density  measure  corresponds 
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precisely  with  Duff’s.  The  sparsity  measure  is  uniformly  near  zero  there,  consistent  with 
the  observation  that  this  representation  offers  no  special  advantage  for  sparse  matrix  ma¬ 
nipulation. 

Let  us  consider  n  X  n  matrices.  Table  1  presents  closed-form  and  asymptotic  results  for 
space,  density,  expected  path  length  (root  to  terminal  node),  and  sparsity  for  the  classes  of 
matrices  treated  above.  In  all  cases,  a  matrix  is  presumed  to  be  completely  dense,  except 
where  the  indicated  pattern  requires  zeros  (or  shared  storage  in  the  case  of  symmetry). 


Space 

Density* 

Expected  Path 

Sparsity* 

Dense 

f (»2  -  i) 

1 

lgn  +  1 

0 

Symmetric 

|(n  +  2)(n-I) 

1 

2 

lgn  + 1 

0 

Triangular 

|(n  +  2)(n-I) 

1 

2 

lgn  i  3  1 

2  _r  2  2n 

1  _  l 

2  lgn 

FFT  permutation 

nlgn  ,4 n  1 

2  '3  3 

3  lg  n 

8  n 

It”.  -L± _ L 

2  '3  3n 

1  _  0.83 

2  Tgn 

Random  permutation 

+  0.87n  -  | 

3  lg  it 

8  n 

^  +  0.87 

1  _  0-37 

2  lgn 

Tridiagonal 

6n  —  2lg  it  —  5 

0 

10  3,2 

3  n  '  37PT 

1  3.33 

1  “  i?rr 

Pentadiagonal 

8n  —  21g  n  —  9 

0 

10  1  10 

3  n  3n^ 

3.33 

1_  TiTT 

Heptadiagonal 

lln  —  2lgn  —  19 

0 

10  i  5  76 

3  n  3ns 

i  3.33 

1  “  TgTT 

Enneadiagonal 

13n  —  2lgn  —  26 

0 

10  ,  7  100 

T+  n  ~ 

■,  3.33 

1  —  TgiT 

Shuffle  permutation 

3(n  -  1) 

0 

3(1  -  k) 

l  3 

1  — 

Identity 

1 

0 

1 

i  l 

1  Tg"n 

Table  1.  Measures  of  patterned  and  unpatterned  matrices  as  quadtrees. 

*  Density  is  accurate  within  a  term  of  Q(n_1).  Sparsity  is  accurate  within  a  term  of 

0((lg»)“2)- 


The  remarkable  entry  in  the  table  is  for  random  and  FFT  permutations-patterned, 
which  measure  out  to  be  neither  dense  nor  sparse,  in  spite  of  the  fact  that  they  only 
contains  n  non-zeroes  of  n2  entries.  This  is  consistent  with  Duff’s  observation  that  pat¬ 
terning  is  essential  to  sparseness;  the  bit-reversal  permutation  is  characterized  by  its  lack 
of  local  patterning!  It  is  fortunate  that  we  already  prefer  an  alternative  representation  for 
permutation  matrices  that  is  not  dense:  as  vectors  of  integers. 
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Section  8.  Analysis  of  Matrix  Addition 

The  utility  of  any  measure  must  be  demonstrated  in  analytical  or  in  experimental 
studies  of  the  behavior  of  algorithms  on  arguments  with  various  measures.  This  section 
offers  an  example  worst-case  analysis  of  the  matrix  addition,  the  simplest  matrix  algirithm. 
The  measure  of  sparsity  has  not  been  well  studied,  however,  and  more  work  should  be  done. 

Theorem  4.  The  sum  of  two  n  x  n  addends,  respectively  of  density  d\  and  d2  and  of 
sparsity  s\  and  s2,  requires  uniprocessor  (proportional)  time  and  space  within  the  bounds: 

n2(lgn  +  l)  max[0, 1  —  (sx  +  S2)]  =  uniprocessor  time  =  n2(lgn  +  l)[l  —  max(sx, 52))], 

£»2|di  -d2\<  spacegum  <  |n2min(l,d1  +  d2); 
and,  itself  has  sparsity  and  density  measures  within  the  bounds, 

max(0,  sx  +  s2  -  1)  <  sparsity Bum  <  1  -  |sx  -  s2|; 

jdi  -  d2|  <  densityBUxn  <  min(l,dx  +  d2). 

Proof:  These  results  follow  from  the  following  observations.  The  sum  will  be,  at  worst, 
dense: 

spaceaum  <  S£i(lgn)  <  f n2  -  1/3  <  fn2, 
and  is  no  larger  than  the  sum  of  the  space  occupied  by  the  addends:  shape 

spaceBum  <  space  j  4-  space2  <  SD[lgn)[di  +  d2). 

If  one  addend  is  the  negative  of  the  other,  then  the  space  for  the  sum,  NIL,  is  zero; 

spaceBuni  >  | space1  -  space2\  =  0  =  \n2\di  -  d2 1; 

otherwise  the  sum-tree  must,  at  least  contain  a  root: 

spaceBUm  >  | spacel  —  space2 j  +  1 

>  |(n2  -  l/4)|d!  -  d2\  +  1  >  fn2|di  -  d2\ 

Let  (proportional)  time  be  measured  by  the  number  of  nodes  visited  in  computing  the 
sum  quadtree;  only  the  bases  of  the  trees,  common  to  both  addends  need  be  traversed 
becaused  unshared  periphery  can  be  borrowed  in  the  sum.  The  number  of  nodes  visited 
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by  an  addition  has  is  zero,  at  best,  and  and  is  at  least  proportional  to  the  path  common 
to  the  two  addends: 


uniprocessortime  >  0; 

>  totalpathi  +  totalpathj  —  totalpathD; 

>  n2(lgn  +  1)  max(0, 1  -  (si  +  s2)). 

It  is  less  than  the  lesser  of  the  two  total  paths: 

uniprocessortime  <  min(totalpath1,totalpath2) 

<  n2(lgn  +  1)(1  -  max(si,s2)). 


The  density  bounds  follow  from  dividing  the  bounds  from  the  cases  considered  in  the 
space  analysis,  above,  by  SoOgn).  In  considering  sparsity,  we  must  consider  the  entire, 
unshared  path  length  in  the  sum: 

totalpathBUIn  >  0; 

<  n2(Ign  +  1); 

>  |totalpathj  -  totalpath2|; 

<  totalpathj  +  totalpath2. 

the  first  two  bounds  occur  when  the  sum  is  NIL  or  completely  dense,  respectively.  The 
second  pair  of  bounds  account  for  the  extreme  cases  where  the  addends  are  nearly  additive 
inverses,  or  have  no  coincident  non-zero  elements.  Since,  for  t  =  1,2: 


totalpath,-  =  n2(lgn  +  !)(!-».). 


I 


max(0,si  +  s2-l)  =  l-min(l,2-(si-t-s2))  <  sparsity  =  1  — totalpath  jj 

n2(lgn  +  1) 

In  general,  however,  analytical  results  like  these  are  difficult  and  so  the  utility  of  these 
measures  ultimately  must  be  established  or  denied  by  experimentation  on  real  data. 


Section  9.  Conclusions 

Measures  of  space  and  expected  depth  of  quadtree  representations  of  various  kinds 
of  “sparse”  matrices  have  been  presented.  All  measures  exceed  those  for  conventional 
representations,  in  most  cases  by  only  a  linear  factor.  Quadtree  representation,  however, 


offers  a  facility  for  process  decomposition  and  scheduling  that  is  unavailable  with  other 
representations,  and  so  the  increased  costs  may  easily  be  recoverable. 

These  measures  have  been  developed  analytically  from  familiar  cases.  Their  utility 
depends  upon  two  kinds  of  experimental  investigation  on  genuine  data.  The  two  questions 
to  be  addressed  are  whether  these  measures  separate  among  cases  that  actually  arise  in  real 
data,  and  whether  the  performance  of  real  algorithms  can  be  reflected  in  these  measures. 

Along  with  experimental  identification  of  populations  of  likely  patterns,  we  should 
seek  analytic  results,  particularly  of  average  cases  because  the  sparsity  measure  is  based 
on  an  expected  value.  Results  on  multiplication  and  matrix  inversion  or  solving  linear 
systems,  similar  to  those  given  here  in  Section  8,  would  be  quite  useful. 

While  the  measures  apply  as  well  to  data  stored  sequentially  on  uniprocessors,  it  is 
particularly  important  to  apply  these  measures  to  multiprocessor  algorithms  because  mul¬ 
tiprocessing  motivates  the  quadtree  representation  that  indirectly  motivates  them.  Only 
with  many  processors  can  the  overhead  inherent  in  this  representation  be  recovered. 

One  might  speculate  that  the  widespread  use  of  permutations  can  cloud  these  exper¬ 
iments.  It  is  not  appropriate  to  study  data  after  it  has  been  artificially  permuted  to  suit 
a  particular  uniprocessor  algorithm  or  architecture.  While  it  is  usual  to  permute  matrices 
on  uniprocessors  in  order  to  bring  the  data  into  a  desirable  pattern,  there  will  be  reason 
to  avoid  repeated  permutations  in  a  rich  multiprocessing  environment.  There  is  a  very 
simple  path  between  memory  and  processor  on  a  uniprocessor,  so  it  is  difficult  to  foresee 
addressing  patterns  hampering  its  effective  bandwidth. 

On  multiprocessors,  however,  one  can  anticipate  that  the  pattern  of  many  processors 
contending  for  access  to  many  memories  can  reduce  that  effective  bandwidth  in  extreme 
cases.  The  simplest  parallel  algorithm  that  can  raise  such  contention  is  an  unpatterned, 
random,  permutation  of  a  vector  stored  across  several  memory  banks.  It  is  difficult  to  be¬ 
lieve  that  each  bank  will  be  regularly  accessed  without  delay  by  other  processors  accessing 
the  same  portion  of  the  address  space.  Whether  the  poor  values  for  permutation  matrices 
is  a  reflection  of  this  problem  remains  to  be  proved. 

If  these  measures  are  a  harbinger  of  that  difficulty,  however,  they  also  point  to  a 
likely  alternative  to  heavy  use  of  permutations.  Perhaps,  it  will  be  desirable  to  avoid  wild 
permutations  (like  the  FFT)  on  parallel  processors,  in  favor  of  alternatives  like  factoring 


simpler  permutations  from  partial  results.  Although  the  FFT  permutation-pattern  mea¬ 
sured  poorly,  it  admits  a  factorization  of  Shuffles  (or  Deals)  each  of  which  measures  out  to 
be  comparatively  tame.  Perhaps  these  permutations  can  be  distributed  out  from  a  parallel 
process  to  a  serial  process  (like  input/output,  usually  on  uniprocessors)  where  they  would 
not  create  bandwidth-consuming,  chaotic  access.  Perhaps,  like  the  permutations  within 
the  fast  Fourier  transform  applied  to  convolution  problems,  these  lazy  permutations  might 
simplify  or  cancel  themselves  by  subsequent  associativity  with  other  postponed  permuta¬ 
tions. 

Acknowledgement:  The  final  draft  of  this  paper  was  assembled  using  facilities  at 
the  Computer  Science  Lab  of  Tektronix,  Inc.  Section  6  and  the  Appendix  belong  to  the 
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Appendix 


Lemma  1.  Ifb>  1  and  |aj  <  1  then  (1  —  a)6  >  1  —  ab. 

Proof:  Compare  the  Taylor  Series  expansion  of  the  logarithm  of  both  sides.| 


Lemma  2.  If  a  >  1  then  (1  —  x)“  <  1  —  ax  +  [ax)2 / 2. 

Proof:  Using  Taylor’s  Series  expansion  (around  zero)  with  remainder  [7,  p.  150] 
write 


(1  —  x)a  =  1  —  ax  ■+- 


a2(l  —  c)ax2 


where  0  <  c  <  x.  Setting  c  =  0  proves  the  lemma.  | 


Lemma  3.  For  all  x,y  >  0,  (l  —  x)y  >  e  xyKx  *). 

Proof:  It  suffices  to  show  that  —  ln(l  —  x)  <  x/[l  —  x).  But 


x2  x3  x4 


-  ln(l  -x)  =  x+  —  +  y+  — 


we  can 


and 


- - =  X  +  X2  +  X3  +  x4  +  ... 

1  —  X 

Straightforward  comparison  proves  the  lemma.  | 

Lemma  4.  For  all  x,y  >  0,  (1  —  x)xy  <  e-*3y. 

Proof:  Take  the  logarithm  of  both  sides,  apply  the  Taylor  Series  expansion  to  the  left  side 
and  compare  terms.  | 
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